Joins on Encoded and Partitioned Data
نویسندگان
چکیده
Compression has historically been used to reduce the cost of storage, I/Os from that storage, and buffer pool utilization, at the expense of the CPU required to decompress data every time it is queried. However, significant additional CPU efficiencies can be achieved by deferring decompression as late in query processing as possible and performing query processing operations directly on the still-compressed data. In this paper, we investigate the benefits and challenges of performing joins on compressed (or encoded) data. We demonstrate the benefit of independently optimizing the compression scheme of each join column, even though join predicates relating values from multiple columns may require translation of the encoding of one join column into the encoding of the other. We also show the benefit of compressing “payload” data other than the join columns “on the fly,” to minimize the size of hash tables used in the join. By partitioning the domain of each column and defining separate dictionaries for each partition, we can achieve even better overall compression as well as increased flexibility in dealing with new values introduced by updates. Instead of decompressing both join columns participating in a join to resolve their different compression schemes, our system performs a light-weight mapping of only qualifying rows from one of the join columns to the encoding space of the other at run time. Consequently, join predicates can be applied directly on the compressed data. We call this procedure encoding translation. Two alternatives of encoding translation are developed and compared in the paper. We provide a comprehensive evaluation of these alternatives using product implementations of each on the TPC-H data set, and demonstrate that performing joins on encoded and partitioned data achieves both superior performance and excellent compression.
منابع مشابه
Memory-Efficient Hash Joins
We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded populatio...
متن کاملAn Evaluation of Non-Equijoin Algorithms
A non-equijoin of relations R and S is a band join if the join predicate requires values in the join attribute of R to fall within a speci ed band about the values in the join attribute of S. We propose a new algorithm, termed a partitioned band join, for evaluating band joins. We present a comparison between the partitioned band join algorithm and the classical sort-merge join algorithm (optim...
متن کاملScalable and Efficient Self-Join Processing technique in RDF data
Efficient management of RDF data plays an important role in successfully understanding and fast querying data. Although the current approaches of indexing in RDF Triples such as property tables and vertically partitioned solved many issues; however, they still suffer from the performance in the complex self-join queries and insert data in the same table. As an improvement in this paper, we prop...
متن کاملBlockJoin: Efficient Matrix Partitioning Through Joins
Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-toend ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii)...
متن کاملExecuting Web Application Queries on a Partitioned Database
Partitioning data over multiple storage servers is an attractive way to increase throughput for web-like workloads. However, there is often no one partitioning that yields good performance for all queries, and it can be challenging for the web developer to determine how best to execute queries over partitioned data. This paper presents DIXIE, a SQL query planner, optimizer, and executor for dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2014